Template Credit: Adapted from a template made available by Dr. Jason Brownlee of Machine Learning Mastery. https://machinelearningmastery.com/

SUMMARY: The purpose of this project is to construct a prediction model using various machine learning algorithms and to document the end-to-end steps using a template. The Online News Popularity dataset is a regression situation where we are trying to predict the value of a continuous variable.

INTRODUCTION: This dataset summarizes a heterogeneous set of features about articles published by Mashable in a period of two years. The goal is to predict the article’s popularity level in social networks. The dataset does not contain the original content, but some statistics associated with it. The original content can be publicly accessed and retrieved using the provided URLs.

Many thanks to K. Fernandes, P. Vinagre and P. Cortez. A Proactive Intelligent Decision Support System for Predicting the Popularity of Online News. Proceedings of the 17th EPIA 2015 - Portuguese Conference on Artificial Intelligence, September, Coimbra, Portugal, for making the dataset and benchmarking information available.

In iteration Take1, the script focused on evaluating various machine learning algorithms and identifying the algorithm that produces the best accuracy result. Iteration Take1 established a baseline performance regarding accuracy and processing time.

For this iteration, we will examine the feasibility of using a dimensionality reduction technique of ranking the attribute importance with a gradient boosting tree method. Afterward, we will eliminate the features that do not contribute to the cumulative importance of 0.99 (or 99%).

ANALYSIS: From the previous iteration Take1, the baseline performance of the machine learning algorithms achieved an average RMSE of 10446. Two algorithms (Random Forest and Stochastic Gradient Boosting) achieved the top RMSE scores after the first round of modeling. After a series of tuning trials, Random Forest turned in the top result using the training data. It achieved the best RMSE of 10299. Using the optimized tuning parameter available, the Random Forest algorithm processed the validation dataset with an RMSE of 12978, which was slightly worse than the accuracy of the training data and possibly due to over-fitting.

In the current iteration, the baseline performance of the machine learning algorithms achieved an average RMSE of 10409. Two algorithms (ElasticNet and Stochastic Gradient Boosting) achieved the top RMSE scores after the first round of modeling. After a series of tuning trials, Stochastic Gradient Boosting turned in the top result using the training data. It achieved the best RMSE of 10312. Using the optimized tuning parameter available, the Stochastic Gradient Boosting algorithm processed the validation dataset with an RMSE of 13007, which was worse than the accuracy of the training data and possibly due to over-fitting.

From the model-building activities, the number of attributes went from 58 down to 35 after eliminating 23 attributes. The processing time went from 21 hours 7 minutes in iteration Take1 down to 11 hours 41 minutes in iteration Take2, which was a reduction of 44% from Take1.

CONCLUSION: The feature selection techniques helped by cutting down the attributes and reduced the training time. Furthermore, the modeling took a much shorter time to process yet still retained a comparable level of accuracy. For this iteration, the Random Forest algorithm achieved the top training and validation results comparing to other machine learning algorithms. For this dataset, Random Forest should be considered for further modeling or production use.

Dataset Used: Online News Popularity Dataset

Dataset ML Model: Regression with numerical attributes

Dataset Reference: https://archive.ics.uci.edu/ml/datasets/Online+News+Popularity

The project aims to touch on the following areas:

  1. Document a predictive modeling problem end-to-end.
  2. Explore data cleaning and transformation options
  3. Explore non-ensemble and ensemble algorithms for baseline model performance
  4. Explore algorithm tuning techniques for improving model performance

Any predictive modeling machine learning project genrally can be broken down into about six major tasks:

  1. Prepare Problem
  2. Summarize Data
  3. Prepare Data
  4. Model and Evaluate Algorithms
  5. Improve Accuracy or Results
  6. Finalize Model and Present Results

1. Prepare Problem

1.a) Load libraries

startTimeScript <- proc.time()
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
library(corrplot)
## corrplot 0.84 loaded
library(parallel)
library(mailR)

# Create one random seed number for reproducible results
seedNum <- 888
set.seed(seedNum)

1.b) Load dataset

originalDataset <- read.csv("OnlineNewsPopularity.csv", header= TRUE)

# Dropping the two non-predictive attributes: url and timedelta
originalDataset$url <- NULL
originalDataset$timedelta <- NULL

# Different ways of reading and processing the input dataset. Saving these for future references.
#x_train <- read.fwf("X_train.txt", widths = widthVector, col.names = colNames)
#y_train <- read.csv("y_train.txt", header = FALSE, col.names = c("targetVar"))
#y_train$targetVar <- as.factor(y_train$targetVar)
#xy_train <- cbind(x_train, y_train)
# Use variable totCol to hold the number of columns in the dataframe
totCol <- ncol(originalDataset)

# Set up variable totAttr for the total number of attribute columns
totAttr <- totCol-1
# targetCol variable indicates the column location of the target/class variable
# If the first column, set targetCol to 1. If the last column, set targetCol to totCol
# if (targetCol <> 1) and (targetCol <> totCol), be aware when slicing up the dataframes for visualization! 
targetCol <- totCol
colnames(originalDataset)[targetCol] <- "targetVar"
# We create training datasets (xy_train, x_train, y_train) for various operations.
# We create validation datasets (xy_test, x_test, y_test) for various operations.
set.seed(seedNum)

# Create a list of the rows in the original dataset we can use for training
training_index <- createDataPartition(originalDataset$targetVar, p=0.70, list=FALSE)
# Use 70% of the data to train the models and the remaining for testing/validation
xy_train <- originalDataset[training_index,]
xy_test <- originalDataset[-training_index,]

if (targetCol==1) {
x_train <- xy_train[,(targetCol+1):totCol]
y_train <- xy_train[,targetCol]
y_test <- xy_test[,targetCol]
} else {
x_train <- xy_train[,1:(totAttr)]
y_train <- xy_train[,totCol]
y_test <- xy_test[,totCol]
}

1.c) Set up the key parameters to be used in the script

# Set up the number of row and columns for visualization display. dispRow * dispCol should be >= totAttr
dispCol <- 4
if (totAttr%%dispCol == 0) {
dispRow <- totAttr%/%dispCol
} else {
dispRow <- (totAttr%/%dispCol) + 1
}
cat("Will attempt to create graphics grid (col x row): ", dispCol, ' by ', dispRow)
## Will attempt to create graphics grid (col x row):  4  by  15

1.d) Set test options and evaluation metric

# Run algorithms using 10-fold cross validation
control <- trainControl(method="repeatedcv", number=10, repeats=1)
metricTarget <- "RMSE"

1.e) Set up the email notification function

email_notify <- function(msg=""){
  sender <- "luozhi2488@gmail.com"
  receiver <- "dave@contactdavidlowe.com"
  sbj_line <- "Notification from R Script"
  password <- readLines("email_credential.txt")
  send.mail(
    from = sender,
    to = receiver,
    subject= sbj_line,
    body = msg,
    smtp = list(host.name = "smtp.gmail.com", port = 465, user.name = sender, passwd = password, ssl = TRUE),
    authenticate = TRUE,
    send = TRUE)
}
email_notify(paste("Library and Data Loading Completed!",date()))
## [1] "Java-Object{org.apache.commons.mail.SimpleEmail@47fd17e3}"

2. Summarize Data

To gain a better understanding of the data that we have on-hand, we will leverage a number of descriptive statistics and data visualization techniques. The plan is to use the results to consider new questions, review assumptions, and validate hypotheses that we can investigate later with specialized models.

2.a) Descriptive statistics

2.a.i) Peek at the data itself.

head(xy_train)
##   n_tokens_title n_tokens_content n_unique_tokens n_non_stop_words
## 2              9              255       0.6047431                1
## 3              9              211       0.5751295                1
## 5             13             1072       0.4156456                1
## 6             10              370       0.5598886                1
## 7              8              960       0.4181626                1
## 8             12              989       0.4335736                1
##   n_non_stop_unique_tokens num_hrefs num_self_hrefs num_imgs num_videos
## 2                0.7919463         3              1        1          0
## 3                0.6638655         3              1        1          0
## 5                0.5408895        19             19       20          0
## 6                0.6981982         2              2        0          0
## 7                0.5498339        21             20       20          0
## 8                0.5721078        20             20       20          0
##   average_token_length num_keywords data_channel_is_lifestyle
## 2             4.913725            4                         0
## 3             4.393365            6                         0
## 5             4.682836            7                         0
## 6             4.359459            9                         0
## 7             4.654167           10                         1
## 8             4.617796            9                         0
##   data_channel_is_entertainment data_channel_is_bus data_channel_is_socmed
## 2                             0                   1                      0
## 3                             0                   1                      0
## 5                             0                   0                      0
## 6                             0                   0                      0
## 7                             0                   0                      0
## 8                             0                   0                      0
##   data_channel_is_tech data_channel_is_world kw_min_min kw_max_min
## 2                    0                     0          0          0
## 3                    0                     0          0          0
## 5                    1                     0          0          0
## 6                    1                     0          0          0
## 7                    0                     0          0          0
## 8                    1                     0          0          0
##   kw_avg_min kw_min_max kw_max_max kw_avg_max kw_min_avg kw_max_avg
## 2          0          0          0          0          0          0
## 3          0          0          0          0          0          0
## 5          0          0          0          0          0          0
## 6          0          0          0          0          0          0
## 7          0          0          0          0          0          0
## 8          0          0          0          0          0          0
##   kw_avg_avg self_reference_min_shares self_reference_max_shares
## 2          0                         0                         0
## 3          0                       918                       918
## 5          0                       545                     16000
## 6          0                      8500                      8500
## 7          0                       545                     16000
## 8          0                       545                     16000
##   self_reference_avg_sharess weekday_is_monday weekday_is_tuesday
## 2                      0.000                 1                  0
## 3                    918.000                 1                  0
## 5                   3151.158                 1                  0
## 6                   8500.000                 1                  0
## 7                   3151.158                 1                  0
## 8                   3151.158                 1                  0
##   weekday_is_wednesday weekday_is_thursday weekday_is_friday
## 2                    0                   0                 0
## 3                    0                   0                 0
## 5                    0                   0                 0
## 6                    0                   0                 0
## 7                    0                   0                 0
## 8                    0                   0                 0
##   weekday_is_saturday weekday_is_sunday is_weekend     LDA_00     LDA_01
## 2                   0                 0          0 0.79975569 0.05004668
## 3                   0                 0          0 0.21779229 0.03333446
## 5                   0                 0          0 0.02863281 0.02879355
## 6                   0                 0          0 0.02224528 0.30671758
## 7                   0                 0          0 0.02008167 0.11470539
## 8                   0                 0          0 0.02222436 0.15073297
##       LDA_02     LDA_03     LDA_04 global_subjectivity
## 2 0.05009625 0.05010067 0.05000071           0.3412458
## 3 0.03335142 0.03333354 0.68218829           0.7022222
## 5 0.02857518 0.02857168 0.88542678           0.5135021
## 6 0.02223128 0.02222429 0.62658158           0.4374086
## 7 0.02002437 0.02001533 0.82517325           0.5144803
## 8 0.24343548 0.02222360 0.56138359           0.5434742
##   global_sentiment_polarity global_rate_positive_words
## 2                0.14894781                 0.04313725
## 3                0.32333333                 0.05687204
## 5                0.28100348                 0.07462687
## 6                0.07118419                 0.02972973
## 7                0.26830272                 0.08020833
## 8                0.29861347                 0.08392315
##   global_rate_negative_words rate_positive_words rate_negative_words
## 2                0.015686275           0.7333333           0.2666667
## 3                0.009478673           0.8571429           0.1428571
## 5                0.012126866           0.8602151           0.1397849
## 6                0.027027027           0.5238095           0.4761905
## 7                0.016666667           0.8279570           0.1720430
## 8                0.015166835           0.8469388           0.1530612
##   avg_positive_polarity min_positive_polarity max_positive_polarity
## 2             0.2869146            0.03333333                   0.7
## 3             0.4958333            0.10000000                   1.0
## 5             0.4111274            0.03333333                   1.0
## 6             0.3506100            0.13636364                   0.6
## 7             0.4020386            0.10000000                   1.0
## 8             0.4277205            0.10000000                   1.0
##   avg_negative_polarity min_negative_polarity max_negative_polarity
## 2            -0.1187500                -0.125            -0.1000000
## 3            -0.4666667                -0.800            -0.1333333
## 5            -0.2201923                -0.500            -0.0500000
## 6            -0.1950000                -0.400            -0.1000000
## 7            -0.2244792                -0.500            -0.0500000
## 8            -0.2427778                -0.500            -0.0500000
##   title_subjectivity title_sentiment_polarity abs_title_subjectivity
## 2          0.0000000                0.0000000             0.50000000
## 3          0.0000000                0.0000000             0.50000000
## 5          0.4545455                0.1363636             0.04545455
## 6          0.6428571                0.2142857             0.14285714
## 7          0.0000000                0.0000000             0.50000000
## 8          1.0000000                0.5000000             0.50000000
##   abs_title_sentiment_polarity targetVar
## 2                    0.0000000       711
## 3                    0.0000000      1500
## 5                    0.1363636       505
## 6                    0.2142857       855
## 7                    0.0000000       556
## 8                    0.5000000       891

2.a.ii) Dimensions of the dataset.

dim(xy_train)
## [1] 27752    59
dim(xy_test)
## [1] 11892    59

2.a.iii) Types of the attributes.

sapply(xy_train, class)
##                n_tokens_title              n_tokens_content 
##                     "numeric"                     "numeric" 
##               n_unique_tokens              n_non_stop_words 
##                     "numeric"                     "numeric" 
##      n_non_stop_unique_tokens                     num_hrefs 
##                     "numeric"                     "numeric" 
##                num_self_hrefs                      num_imgs 
##                     "numeric"                     "numeric" 
##                    num_videos          average_token_length 
##                     "numeric"                     "numeric" 
##                  num_keywords     data_channel_is_lifestyle 
##                     "numeric"                     "numeric" 
## data_channel_is_entertainment           data_channel_is_bus 
##                     "numeric"                     "numeric" 
##        data_channel_is_socmed          data_channel_is_tech 
##                     "numeric"                     "numeric" 
##         data_channel_is_world                    kw_min_min 
##                     "numeric"                     "numeric" 
##                    kw_max_min                    kw_avg_min 
##                     "numeric"                     "numeric" 
##                    kw_min_max                    kw_max_max 
##                     "numeric"                     "numeric" 
##                    kw_avg_max                    kw_min_avg 
##                     "numeric"                     "numeric" 
##                    kw_max_avg                    kw_avg_avg 
##                     "numeric"                     "numeric" 
##     self_reference_min_shares     self_reference_max_shares 
##                     "numeric"                     "numeric" 
##    self_reference_avg_sharess             weekday_is_monday 
##                     "numeric"                     "numeric" 
##            weekday_is_tuesday          weekday_is_wednesday 
##                     "numeric"                     "numeric" 
##           weekday_is_thursday             weekday_is_friday 
##                     "numeric"                     "numeric" 
##           weekday_is_saturday             weekday_is_sunday 
##                     "numeric"                     "numeric" 
##                    is_weekend                        LDA_00 
##                     "numeric"                     "numeric" 
##                        LDA_01                        LDA_02 
##                     "numeric"                     "numeric" 
##                        LDA_03                        LDA_04 
##                     "numeric"                     "numeric" 
##           global_subjectivity     global_sentiment_polarity 
##                     "numeric"                     "numeric" 
##    global_rate_positive_words    global_rate_negative_words 
##                     "numeric"                     "numeric" 
##           rate_positive_words           rate_negative_words 
##                     "numeric"                     "numeric" 
##         avg_positive_polarity         min_positive_polarity 
##                     "numeric"                     "numeric" 
##         max_positive_polarity         avg_negative_polarity 
##                     "numeric"                     "numeric" 
##         min_negative_polarity         max_negative_polarity 
##                     "numeric"                     "numeric" 
##            title_subjectivity      title_sentiment_polarity 
##                     "numeric"                     "numeric" 
##        abs_title_subjectivity  abs_title_sentiment_polarity 
##                     "numeric"                     "numeric" 
##                     targetVar 
##                     "integer"

2.a.iv) Statistical summary of all attributes.

summary(xy_train)
##  n_tokens_title n_tokens_content n_unique_tokens    n_non_stop_words  
##  Min.   : 3.0   Min.   :   0.0   Min.   :  0.0000   Min.   :   0.000  
##  1st Qu.: 9.0   1st Qu.: 246.0   1st Qu.:  0.4707   1st Qu.:   1.000  
##  Median :10.0   Median : 409.0   Median :  0.5393   Median :   1.000  
##  Mean   :10.4   Mean   : 547.2   Mean   :  0.5555   Mean   :   1.008  
##  3rd Qu.:12.0   3rd Qu.: 716.0   3rd Qu.:  0.6081   3rd Qu.:   1.000  
##  Max.   :23.0   Max.   :8474.0   Max.   :701.0000   Max.   :1042.000  
##  n_non_stop_unique_tokens   num_hrefs      num_self_hrefs  
##  Min.   :  0.0000         Min.   :  0.00   Min.   : 0.000  
##  1st Qu.:  0.6255         1st Qu.:  4.00   1st Qu.: 1.000  
##  Median :  0.6903         Median :  7.00   Median : 3.000  
##  Mean   :  0.6957         Mean   : 10.88   Mean   : 3.302  
##  3rd Qu.:  0.7542         3rd Qu.: 14.00   3rd Qu.: 4.000  
##  Max.   :650.0000         Max.   :304.00   Max.   :74.000  
##     num_imgs         num_videos     average_token_length  num_keywords   
##  Min.   :  0.000   Min.   : 0.000   Min.   :0.000        Min.   : 1.000  
##  1st Qu.:  1.000   1st Qu.: 0.000   1st Qu.:4.477        1st Qu.: 6.000  
##  Median :  1.000   Median : 0.000   Median :4.662        Median : 7.000  
##  Mean   :  4.563   Mean   : 1.262   Mean   :4.546        Mean   : 7.227  
##  3rd Qu.:  4.000   3rd Qu.: 1.000   3rd Qu.:4.854        3rd Qu.: 9.000  
##  Max.   :111.000   Max.   :91.000   Max.   :6.610        Max.   :10.000  
##  data_channel_is_lifestyle data_channel_is_entertainment
##  Min.   :0.00000           Min.   :0.000                
##  1st Qu.:0.00000           1st Qu.:0.000                
##  Median :0.00000           Median :0.000                
##  Mean   :0.05387           Mean   :0.178                
##  3rd Qu.:0.00000           3rd Qu.:0.000                
##  Max.   :1.00000           Max.   :1.000                
##  data_channel_is_bus data_channel_is_socmed data_channel_is_tech
##  Min.   :0.0000      Min.   :0.00000        Min.   :0.0000      
##  1st Qu.:0.0000      1st Qu.:0.00000        1st Qu.:0.0000      
##  Median :0.0000      Median :0.00000        Median :0.0000      
##  Mean   :0.1579      Mean   :0.05801        Mean   :0.1864      
##  3rd Qu.:0.0000      3rd Qu.:0.00000        3rd Qu.:0.0000      
##  Max.   :1.0000      Max.   :1.00000        Max.   :1.0000      
##  data_channel_is_world   kw_min_min       kw_max_min       kw_avg_min     
##  Min.   :0.0000        Min.   : -1.00   Min.   :     0   Min.   :   -1.0  
##  1st Qu.:0.0000        1st Qu.: -1.00   1st Qu.:   450   1st Qu.:  141.9  
##  Median :0.0000        Median : -1.00   Median :   662   Median :  235.1  
##  Mean   :0.2092        Mean   : 26.13   Mean   :  1159   Mean   :  313.8  
##  3rd Qu.:0.0000        3rd Qu.:  4.00   3rd Qu.:  1000   3rd Qu.:  356.8  
##  Max.   :1.0000        Max.   :377.00   Max.   :298400   Max.   :42827.9  
##    kw_min_max       kw_max_max       kw_avg_max       kw_min_avg  
##  Min.   :     0   Min.   :     0   Min.   :     0   Min.   :  -1  
##  1st Qu.:     0   1st Qu.:843300   1st Qu.:172048   1st Qu.:   0  
##  Median :  1400   Median :843300   Median :245025   Median :1034  
##  Mean   : 13458   Mean   :752066   Mean   :259524   Mean   :1122  
##  3rd Qu.:  7900   3rd Qu.:843300   3rd Qu.:331986   3rd Qu.:2066  
##  Max.   :843300   Max.   :843300   Max.   :843300   Max.   :3613  
##    kw_max_avg       kw_avg_avg    self_reference_min_shares
##  Min.   :     0   Min.   :    0   Min.   :     0           
##  1st Qu.:  3564   1st Qu.: 2386   1st Qu.:   638           
##  Median :  4358   Median : 2870   Median :  1200           
##  Mean   :  5640   Mean   : 3137   Mean   :  4084           
##  3rd Qu.:  6021   3rd Qu.: 3605   3rd Qu.:  2600           
##  Max.   :298400   Max.   :43568   Max.   :843300           
##  self_reference_max_shares self_reference_avg_sharess weekday_is_monday
##  Min.   :     0            Min.   :     0             Min.   :0.0000   
##  1st Qu.:  1100            1st Qu.:   985             1st Qu.:0.0000   
##  Median :  2800            Median :  2200             Median :0.0000   
##  Mean   : 10164            Mean   :  6380             Mean   :0.1689   
##  3rd Qu.:  7900            3rd Qu.:  5100             3rd Qu.:0.0000   
##  Max.   :843300            Max.   :843300             Max.   :1.0000   
##  weekday_is_tuesday weekday_is_wednesday weekday_is_thursday
##  Min.   :0.0000     Min.   :0.0000       Min.   :0.0000     
##  1st Qu.:0.0000     1st Qu.:0.0000       1st Qu.:0.0000     
##  Median :0.0000     Median :0.0000       Median :0.0000     
##  Mean   :0.1865     Mean   :0.1886       Mean   :0.1833     
##  3rd Qu.:0.0000     3rd Qu.:0.0000       3rd Qu.:0.0000     
##  Max.   :1.0000     Max.   :1.0000       Max.   :1.0000     
##  weekday_is_friday weekday_is_saturday weekday_is_sunday   is_weekend    
##  Min.   :0.0000    Min.   :0.00000     Min.   :0.00000   Min.   :0.0000  
##  1st Qu.:0.0000    1st Qu.:0.00000     1st Qu.:0.00000   1st Qu.:0.0000  
##  Median :0.0000    Median :0.00000     Median :0.00000   Median :0.0000  
##  Mean   :0.1434    Mean   :0.06191     Mean   :0.06735   Mean   :0.1293  
##  3rd Qu.:0.0000    3rd Qu.:0.00000     3rd Qu.:0.00000   3rd Qu.:0.0000  
##  Max.   :1.0000    Max.   :1.00000     Max.   :1.00000   Max.   :1.0000  
##      LDA_00            LDA_01            LDA_02            LDA_03       
##  Min.   :0.00000   Min.   :0.00000   Min.   :0.00000   Min.   :0.00000  
##  1st Qu.:0.02505   1st Qu.:0.02501   1st Qu.:0.02857   1st Qu.:0.02857  
##  Median :0.03339   Median :0.03334   Median :0.04000   Median :0.04000  
##  Mean   :0.18415   Mean   :0.14087   Mean   :0.21465   Mean   :0.22515  
##  3rd Qu.:0.24039   3rd Qu.:0.15034   3rd Qu.:0.32802   3rd Qu.:0.38152  
##  Max.   :0.92699   Max.   :0.92595   Max.   :0.92000   Max.   :0.91998  
##      LDA_04        global_subjectivity global_sentiment_polarity
##  Min.   :0.00000   Min.   :0.0000      Min.   :-0.38021         
##  1st Qu.:0.02857   1st Qu.:0.3955      1st Qu.: 0.05712         
##  Median :0.04073   Median :0.4534      Median : 0.11867         
##  Mean   :0.23514   Mean   :0.4430      Mean   : 0.11861         
##  3rd Qu.:0.40359   3rd Qu.:0.5083      3rd Qu.: 0.17700         
##  Max.   :0.92712   Max.   :1.0000      Max.   : 0.65500         
##  global_rate_positive_words global_rate_negative_words rate_positive_words
##  Min.   :0.00000            Min.   :0.000000           Min.   :0.0000     
##  1st Qu.:0.02834            1st Qu.:0.009662           1st Qu.:0.6000     
##  Median :0.03888            Median :0.015326           Median :0.7097     
##  Mean   :0.03955            Mean   :0.016647           Mean   :0.6815     
##  3rd Qu.:0.05025            3rd Qu.:0.021739           3rd Qu.:0.8000     
##  Max.   :0.15217            Max.   :0.184932           Max.   :1.0000     
##  rate_negative_words avg_positive_polarity min_positive_polarity
##  Min.   :0.0000      Min.   :0.0000        Min.   :0.00000      
##  1st Qu.:0.1852      1st Qu.:0.3056        1st Qu.:0.05000      
##  Median :0.2800      Median :0.3583        Median :0.10000      
##  Mean   :0.2884      Mean   :0.3532        Mean   :0.09536      
##  3rd Qu.:0.3846      3rd Qu.:0.4108        3rd Qu.:0.10000      
##  Max.   :1.0000      Max.   :1.0000        Max.   :1.00000      
##  max_positive_polarity avg_negative_polarity min_negative_polarity
##  Min.   :0.0000        Min.   :-1.0000       Min.   :-1.0000      
##  1st Qu.:0.6000        1st Qu.:-0.3282       1st Qu.:-0.7000      
##  Median :0.8000        Median :-0.2536       Median :-0.5000      
##  Mean   :0.7553        Mean   :-0.2596       Mean   :-0.5222      
##  3rd Qu.:1.0000        3rd Qu.:-0.1873       3rd Qu.:-0.3000      
##  Max.   :1.0000        Max.   : 0.0000       Max.   : 0.0000      
##  max_negative_polarity title_subjectivity title_sentiment_polarity
##  Min.   :-1.0000       Min.   :0.0000     Min.   :-1.00000        
##  1st Qu.:-0.1250       1st Qu.:0.0000     1st Qu.: 0.00000        
##  Median :-0.1000       Median :0.1429     Median : 0.00000        
##  Mean   :-0.1073       Mean   :0.2819     Mean   : 0.07093        
##  3rd Qu.:-0.0500       3rd Qu.:0.5000     3rd Qu.: 0.13750        
##  Max.   : 0.0000       Max.   :1.0000     Max.   : 1.00000        
##  abs_title_subjectivity abs_title_sentiment_polarity   targetVar     
##  Min.   :0.0000         Min.   :0.0000               Min.   :     4  
##  1st Qu.:0.1667         1st Qu.:0.0000               1st Qu.:   946  
##  Median :0.5000         Median :0.0000               Median :  1400  
##  Mean   :0.3419         Mean   :0.1558               Mean   :  3366  
##  3rd Qu.:0.5000         3rd Qu.:0.2500               3rd Qu.:  2800  
##  Max.   :0.5000         Max.   :1.0000               Max.   :690400

2.a.v) Count missing values.

sapply(xy_train, function(x) sum(is.na(x)))
##                n_tokens_title              n_tokens_content 
##                             0                             0 
##               n_unique_tokens              n_non_stop_words 
##                             0                             0 
##      n_non_stop_unique_tokens                     num_hrefs 
##                             0                             0 
##                num_self_hrefs                      num_imgs 
##                             0                             0 
##                    num_videos          average_token_length 
##                             0                             0 
##                  num_keywords     data_channel_is_lifestyle 
##                             0                             0 
## data_channel_is_entertainment           data_channel_is_bus 
##                             0                             0 
##        data_channel_is_socmed          data_channel_is_tech 
##                             0                             0 
##         data_channel_is_world                    kw_min_min 
##                             0                             0 
##                    kw_max_min                    kw_avg_min 
##                             0                             0 
##                    kw_min_max                    kw_max_max 
##                             0                             0 
##                    kw_avg_max                    kw_min_avg 
##                             0                             0 
##                    kw_max_avg                    kw_avg_avg 
##                             0                             0 
##     self_reference_min_shares     self_reference_max_shares 
##                             0                             0 
##    self_reference_avg_sharess             weekday_is_monday 
##                             0                             0 
##            weekday_is_tuesday          weekday_is_wednesday 
##                             0                             0 
##           weekday_is_thursday             weekday_is_friday 
##                             0                             0 
##           weekday_is_saturday             weekday_is_sunday 
##                             0                             0 
##                    is_weekend                        LDA_00 
##                             0                             0 
##                        LDA_01                        LDA_02 
##                             0                             0 
##                        LDA_03                        LDA_04 
##                             0                             0 
##           global_subjectivity     global_sentiment_polarity 
##                             0                             0 
##    global_rate_positive_words    global_rate_negative_words 
##                             0                             0 
##           rate_positive_words           rate_negative_words 
##                             0                             0 
##         avg_positive_polarity         min_positive_polarity 
##                             0                             0 
##         max_positive_polarity         avg_negative_polarity 
##                             0                             0 
##         min_negative_polarity         max_negative_polarity 
##                             0                             0 
##            title_subjectivity      title_sentiment_polarity 
##                             0                             0 
##        abs_title_subjectivity  abs_title_sentiment_polarity 
##                             0                             0 
##                     targetVar 
##                             0

2.b) Data visualizations

# Boxplots for each attribute
# par(mfrow=c(dispRow,dispCol))
for(i in 1:totAttr) {
    boxplot(x_train[,i], main=names(x_train)[i])
}

# Histograms each attribute
# par(mfrow=c(dispRow,dispCol))
for(i in 1:totAttr) {
    hist(x_train[,i], main=names(x_train)[i])
}

# Density plot for each attribute
# par(mfrow=c(dispRow,dispCol))
for(i in 1:totAttr) {
    plot(density(x_train[,i]), main=names(x_train)[i])
}

# Correlation plot
correlations <- cor(x_train)
corrplot(correlations, method="circle")

email_notify(paste("Data Summary and Visualization Completed!",date()))
## [1] "Java-Object{org.apache.commons.mail.SimpleEmail@34340fab}"

3. Prepare Data

Some dataset may require additional preparation activities that will best exposes the structure of the problem and the relationships between the input attributes and the output variable. Some data-prep tasks might include:

3.a) Data Cleaning

# Not applicable for this iteration of the project.

# Mark missing values
#invalid <- 0
#entireDataset$some_col[entireDataset$some_col==invalid] <- NA

# Impute missing values
#entireDataset$some_col <- with(entireDataset, impute(some_col, mean))

3.b) Feature Selection

# Using the Lasso algorithm, we try to rank the attributes' importance.
startTimeModule <- proc.time()
set.seed(seedNum)
model_fs <- train(targetVar~., data=xy_train, method="lasso", preProcess="scale", trControl=control)
rankedImportance <- varImp(model_fs, scale=FALSE)
print(rankedImportance)
## loess r-squared variable importance
## 
##   only 20 most important variables shown (out of 58)
## 
##                                Overall
## kw_avg_avg                   0.0156396
## LDA_03                       0.0079458
## kw_min_avg                   0.0068456
## global_subjectivity          0.0044389
## LDA_02                       0.0040366
## data_channel_is_world        0.0025531
## avg_negative_polarity        0.0024768
## kw_avg_max                   0.0024501
## num_hrefs                    0.0023209
## avg_positive_polarity        0.0017638
## kw_max_min                   0.0016875
## num_imgs                     0.0016797
## self_reference_max_shares    0.0014199
## self_reference_min_shares    0.0011945
## average_token_length         0.0010510
## LDA_04                       0.0009030
## global_rate_positive_words   0.0008454
## LDA_01                       0.0007778
## global_sentiment_polarity    0.0006893
## abs_title_sentiment_polarity 0.0006887
plot(rankedImportance)

# Set the importance threshold and calculate the list of attributes that don't contribute to the importance threshold
maxThreshold <- 0.99
rankedAttributes <- rankedImportance$importance
rankedAttributes <- rankedAttributes[order(-rankedAttributes$Overall),,drop=FALSE]
totalWeight <- sum(rankedAttributes)
i <- 1
accumWeight <- 0
exit_now <- FALSE
while ((i <= totAttr) & !exit_now) {
  accumWeight = accumWeight + rankedAttributes[i,]
  if ((accumWeight/totalWeight) >= maxThreshold) {
    exit_now <- TRUE
  } else {
    i <- i + 1
  }
}
lowImportance <- rankedAttributes[(i+1):(totAttr),,drop=FALSE]
lowAttributes <- rownames(lowImportance)
cat('Number of attributes contributed to the importance threshold:',i,"\n")
## Number of attributes contributed to the importance threshold: 35
cat('Number of attributes found to be of low importance:',length(lowAttributes))
## Number of attributes found to be of low importance: 23
# Removing the unselected attributes from the training and validation dataframes
xy_train <- xy_train[, !(names(xy_train) %in% lowAttributes)]
xy_test <- xy_test[, !(names(xy_test) %in% lowAttributes)]

3.c) Data Transforms

# Not applicable for this iteration of the project.
proc.time()-startTimeScript
##    user  system elapsed 
## 162.607   1.171 168.144
email_notify(paste("Data Cleaning and Transformation Completed!",date()))
## [1] "Java-Object{org.apache.commons.mail.SimpleEmail@546a03af}"

4. Model and Evaluate Algorithms

After the data-prep, we next work on finding a workable model by evaluating a subset of machine learning algorithms that are good at exploiting the structure of the dataset. The typical evaluation tasks include:

For this project, we will evaluate four linear, three non-linear, and three ensemble algorithms:

Linear Algorithms: Linear Regression, Ridge, LASSO, and ElasticNet

Non-Linear Algorithms: Decision Trees (CART), k-Nearest Neighbors, and Support Vector Machine

Ensemble Algorithms: Bagged CART, Random Forest, and Stochastic Gradient Boosting

The random number seed is reset before each run to ensure that the evaluation of each algorithm is performed using the same data splits. It ensures the results are directly comparable.

4.a) Generate models using linear algorithms

# Linear Regression (Regression)
startTimeModule <- proc.time()
set.seed(seedNum)
fit.lm <- train(targetVar~., data=xy_train, method="lm", metric=metricTarget, trControl=control)
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient
## fit may be misleading
print(fit.lm)
## Linear Regression 
## 
## 27752 samples
##    35 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 1 times) 
## Summary of sample sizes: 24977, 24978, 24977, 24977, 24977, 24976, ... 
## Resampling results:
## 
##   RMSE      Rsquared    MAE     
##   10328.58  0.02489697  3015.021
## 
## Tuning parameter 'intercept' was held constant at a value of TRUE
proc.time()-startTimeModule
##    user  system elapsed 
##   2.914   0.000   2.946
email_notify(paste("Linear Regression Modeling Completed!",date()))
## [1] "Java-Object{org.apache.commons.mail.SimpleEmail@2aaf7cc2}"
# Ridge (Regression)
startTimeModule <- proc.time()
set.seed(seedNum)
fit.ridge <- train(targetVar~., data=xy_train, method="ridge", metric=metricTarget, trControl=control)
print(fit.ridge)
## Ridge Regression 
## 
## 27752 samples
##    35 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 1 times) 
## Summary of sample sizes: 24977, 24978, 24977, 24977, 24977, 24976, ... 
## Resampling results across tuning parameters:
## 
##   lambda  RMSE      Rsquared    MAE     
##   0e+00   10328.57  0.02489726  3015.014
##   1e-04   10328.56  0.02489923  3015.051
##   1e-01   10325.85  0.02538533  3017.792
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was lambda = 0.1.
proc.time()-startTimeModule
##    user  system elapsed 
##  17.289   0.131  17.609
email_notify(paste("Ridge Modeling Completed!",date()))
## [1] "Java-Object{org.apache.commons.mail.SimpleEmail@357246de}"
# lasso (Regression)
startTimeModule <- proc.time()
set.seed(seedNum)
fit.lasso <- train(targetVar~., data=xy_train, method="lasso", metric=metricTarget, trControl=control)
print(fit.lasso)
## The lasso 
## 
## 27752 samples
##    35 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 1 times) 
## Summary of sample sizes: 24977, 24978, 24977, 24977, 24977, 24976, ... 
## Resampling results across tuning parameters:
## 
##   fraction  RMSE      Rsquared    MAE     
##   0.1       10342.91  0.02501161  3056.274
##   0.5       10328.83  0.02486268  3015.212
##   0.9       10328.48  0.02490989  3014.887
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was fraction = 0.9.
proc.time()-startTimeModule
##    user  system elapsed 
##   7.005   0.076   7.158
email_notify(paste("Lasso Modeling Completed!",date()))
## [1] "Java-Object{org.apache.commons.mail.SimpleEmail@1b40d5f0}"
# ElasticNet (Regression)
startTimeModule <- proc.time()
set.seed(seedNum)
fit.en <- train(targetVar~., data=xy_train, method="enet", metric=metricTarget, trControl=control)
print(fit.en)
## Elasticnet 
## 
## 27752 samples
##    35 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 1 times) 
## Summary of sample sizes: 24977, 24978, 24977, 24977, 24977, 24976, ... 
## Resampling results across tuning parameters:
## 
##   lambda  fraction  RMSE      Rsquared    MAE     
##   0e+00   0.050     10371.71  0.02371949  3088.671
##   0e+00   0.525     10329.25  0.02481668  3015.227
##   0e+00   1.000     10328.57  0.02489726  3015.014
##   1e-04   0.050     10384.69  0.02379999  3102.601
##   1e-04   0.525     10322.53  0.02601962  3013.107
##   1e-04   1.000     10328.56  0.02489923  3015.051
##   1e-01   0.050     10401.74  0.02379999  3120.703
##   1e-01   0.525     10320.52  0.02709483  3020.496
##   1e-01   1.000     10325.85  0.02538533  3017.792
## 
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were fraction = 0.525 and lambda = 0.1.
proc.time()-startTimeModule
##    user  system elapsed 
##  17.513   0.150  17.878
email_notify(paste("ElasticNet Modeling Completed!",date()))
## [1] "Java-Object{org.apache.commons.mail.SimpleEmail@19bb089b}"

4.b) Generate models using nonlinear algorithms

# Decision Tree - CART (Regression/Classification)
startTimeModule <- proc.time()
set.seed(seedNum)
fit.cart <- train(targetVar~., data=xy_train, method="rpart", metric=metricTarget, trControl=control)
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info =
## trainInfo, : There were missing values in resampled performance measures.
print(fit.cart)
## CART 
## 
## 27752 samples
##    35 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 1 times) 
## Summary of sample sizes: 24977, 24978, 24977, 24977, 24977, 24976, ... 
## Resampling results across tuning parameters:
## 
##   cp           RMSE      Rsquared     MAE     
##   0.005801686  10922.39  0.007524259  3117.510
##   0.009380481  10590.42  0.012998322  3114.704
##   0.012215379  10420.25  0.006983680  3132.171
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was cp = 0.01221538.
proc.time()-startTimeModule
##    user  system elapsed 
##  13.289   0.050  13.490
email_notify(paste("Decision Tree Modeling Completed!",date()))
## [1] "Java-Object{org.apache.commons.mail.SimpleEmail@7e0babb1}"
# k-Nearest Neighbors (Regression/Classification)
startTimeModule <- proc.time()
set.seed(seedNum)
fit.knn <- train(targetVar~., data=xy_train, method="knn", metric=metricTarget, trControl=control)
print(fit.knn)
## k-Nearest Neighbors 
## 
## 27752 samples
##    35 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 1 times) 
## Summary of sample sizes: 24977, 24978, 24977, 24977, 24977, 24976, ... 
## Resampling results across tuning parameters:
## 
##   k  RMSE      Rsquared     MAE     
##   5  11421.11  0.002514543  3300.808
##   7  11085.90  0.002523161  3223.935
##   9  10917.32  0.003106696  3176.312
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 9.
proc.time()-startTimeModule
##    user  system elapsed 
##  96.098   0.335  97.459
email_notify(paste("k-Nearest Neighbors Modeling Completed!",date()))
## [1] "Java-Object{org.apache.commons.mail.SimpleEmail@3cb5cdba}"
# Support Vector Machine (Regression/Classification)
startTimeModule <- proc.time()
set.seed(seedNum)
fit.svm <- train(targetVar~., data=xy_train, method="svmRadial", metric=metricTarget, trControl=control)
print(fit.svm)
## Support Vector Machines with Radial Basis Function Kernel 
## 
## 27752 samples
##    35 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 1 times) 
## Summary of sample sizes: 24977, 24978, 24977, 24977, 24977, 24976, ... 
## Resampling results across tuning parameters:
## 
##   C     RMSE      Rsquared    MAE     
##   0.25  10431.30  0.02635699  2455.264
##   0.50  10421.95  0.02536113  2465.127
##   1.00  10410.31  0.02436759  2481.487
## 
## Tuning parameter 'sigma' was held constant at a value of 0.02147857
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were sigma = 0.02147857 and C = 1.
proc.time()-startTimeModule
##     user   system  elapsed 
## 6585.295   10.086 6667.354
email_notify(paste("Support Vector Machine Modeling Completed!",date()))
## [1] "Java-Object{org.apache.commons.mail.SimpleEmail@2f7c7260}"

4.c) Generate models using ensemble algorithms

In this section, we will explore the use and tuning of ensemble algorithms to see whether we can improve the results.

# Bagged CART (Regression/Classification)
startTimeModule <- proc.time()
set.seed(seedNum)
fit.bagcart <- train(targetVar~., data=xy_train, method="treebag", metric=metricTarget, trControl=control)
print(fit.bagcart)
## Bagged CART 
## 
## 27752 samples
##    35 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 1 times) 
## Summary of sample sizes: 24977, 24978, 24977, 24977, 24977, 24976, ... 
## Resampling results:
## 
##   RMSE      Rsquared    MAE     
##   10404.55  0.01325957  3064.731
proc.time()-startTimeModule
##    user  system elapsed 
##  89.181   2.933  93.093
email_notify(paste("Bagged CART Modeling Completed!",date()))
## [1] "Java-Object{org.apache.commons.mail.SimpleEmail@58ceff1}"
# Random Forest (Regression/Classification)
startTimeModule <- proc.time()
set.seed(seedNum)
fit.rf <- train(targetVar~., data=xy_train, method="rf", metric=metricTarget, trControl=control)
print(fit.rf)
## Random Forest 
## 
## 27752 samples
##    35 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 1 times) 
## Summary of sample sizes: 24977, 24978, 24977, 24977, 24977, 24976, ... 
## Resampling results across tuning parameters:
## 
##   mtry  RMSE      Rsquared    MAE     
##    2    10323.28  0.02802278  3108.021
##   18    10553.97  0.01892718  3302.205
##   35    10802.63  0.01422926  3346.586
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was mtry = 2.
proc.time()-startTimeModule
##      user    system   elapsed 
## 34385.210    45.538 34785.392
email_notify(paste("Random Forest Modeling Completed!",date()))
## [1] "Java-Object{org.apache.commons.mail.SimpleEmail@34ce8af7}"
# Stochastic Gradient Boosting (Regression/Classification)
startTimeModule <- proc.time()
set.seed(seedNum)
fit.gbm <- train(targetVar~., data=xy_train, method="gbm", metric=metricTarget, trControl=control, verbose=F)
print(fit.gbm)
## Stochastic Gradient Boosting 
## 
## 27752 samples
##    35 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 1 times) 
## Summary of sample sizes: 24977, 24978, 24977, 24977, 24977, 24976, ... 
## Resampling results across tuning parameters:
## 
##   interaction.depth  n.trees  RMSE      Rsquared    MAE     
##   1                   50      10314.44  0.02777950  3016.218
##   1                  100      10313.25  0.02779096  3012.786
##   1                  150      10312.43  0.02816375  3005.185
##   2                   50      10328.06  0.02649566  3025.374
##   2                  100      10378.56  0.02076816  3046.568
##   2                  150      10420.67  0.01738631  3058.115
##   3                   50      10391.75  0.01876778  3049.711
##   3                  100      10452.19  0.01656918  3066.063
##   3                  150      10499.50  0.01556987  3085.789
## 
## Tuning parameter 'shrinkage' was held constant at a value of 0.1
## 
## Tuning parameter 'n.minobsinnode' was held constant at a value of 10
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were n.trees = 150,
##  interaction.depth = 1, shrinkage = 0.1 and n.minobsinnode = 10.
proc.time()-startTimeModule
##    user  system elapsed 
## 132.915   0.344 134.583
email_notify(paste("Stochastic Gradient Boosting Modeling Completed!",date()))
## [1] "Java-Object{org.apache.commons.mail.SimpleEmail@19dfb72a}"

4.d) Compare baseline algorithms

results <- resamples(list(LR=fit.lm, RIDGE=fit.ridge, LASSO=fit.lasso, EN=fit.en, CART=fit.cart, kNN=fit.knn, SVM=fit.svm, BagCART=fit.bagcart, RF=fit.rf, GBM=fit.gbm))
summary(results)
## 
## Call:
## summary.resamples(object = results)
## 
## Models: LR, RIDGE, LASSO, EN, CART, kNN, SVM, BagCART, RF, GBM 
## Number of resamples: 10 
## 
## MAE 
##             Min.  1st Qu.   Median     Mean  3rd Qu.     Max. NA's
## LR      2867.533 2898.676 3003.605 3015.021 3100.108 3249.028    0
## RIDGE   2872.465 2903.250 3005.792 3017.792 3102.082 3249.043    0
## LASSO   2867.545 2898.695 3003.617 3014.887 3100.119 3247.568    0
## EN      2867.282 2906.992 3014.840 3020.496 3106.900 3257.336    0
## CART    2963.559 3010.490 3128.610 3132.171 3234.319 3376.677    0
## kNN     2989.525 3058.993 3207.779 3176.312 3288.264 3308.627    0
## SVM     2276.759 2379.823 2494.093 2481.487 2580.222 2711.565    0
## BagCART 2892.971 2981.674 3070.087 3064.731 3154.047 3302.777    0
## RF      2964.433 3018.120 3062.110 3108.021 3214.322 3335.522    0
## GBM     2832.728 2888.009 2995.864 3005.185 3095.759 3250.379    0
## 
## RMSE 
##             Min.  1st Qu.   Median     Mean  3rd Qu.     Max. NA's
## LR      6500.165 7483.286 9344.529 10328.58 13457.40 15714.68    0
## RIDGE   6480.442 7476.860 9344.497 10325.85 13458.44 15713.57    0
## LASSO   6500.103 7483.260 9344.522 10328.48 13457.41 15713.84    0
## EN      6459.913 7465.302 9343.828 10320.52 13463.61 15727.56    0
## CART    6553.440 7633.558 9437.157 10420.25 13528.36 15794.70    0
## kNN     7465.402 8209.274 9973.133 10917.32 13908.63 15882.61    0
## SVM     6478.875 7597.444 9431.716 10410.31 13561.20 15801.25    0
## BagCART 6556.579 7680.579 9395.887 10404.55 13495.49 15776.10    0
## RF      6445.859 7495.870 9357.151 10323.28 13453.04 15717.30    0
## GBM     6401.081 7495.524 9331.630 10312.43 13466.16 15724.04    0
## 
## Rsquared 
##                 Min.     1st Qu.      Median        Mean     3rd Qu.
## LR      0.0097929291 0.013365658 0.025284219 0.024896971 0.035262824
## RIDGE   0.0098243238 0.013363799 0.026825695 0.025385333 0.035926039
## LASSO   0.0097930817 0.013364769 0.025287439 0.024909894 0.035264591
## EN      0.0089120856 0.012600546 0.026574737 0.027094829 0.034270812
## CART    0.0018031994 0.004940545 0.006535587 0.006983680 0.008578722
## kNN     0.0003221285 0.002279891 0.003423772 0.003106696 0.004598293
## SVM     0.0085281696 0.014997980 0.021215927 0.024367590 0.031309812
## BagCART 0.0039707603 0.009297588 0.012881767 0.013259566 0.016808234
## RF      0.0102782261 0.014118012 0.028663337 0.028022785 0.037699586
## GBM     0.0094031191 0.013355254 0.029837365 0.028163752 0.039991507
##                Max. NA's
## LR      0.041759597    0
## RIDGE   0.041600800    0
## LASSO   0.041756295    0
## EN      0.062996646    0
## CART    0.013060346    6
## kNN     0.004993074    0
## SVM     0.051017983    0
## BagCART 0.027560021    0
## RF      0.053680776    0
## GBM     0.048722802    0
dotplot(results)

cat('The average RMSE from all models is:',
    mean(c(results$values$`LR~RMSE`, results$values$`RIDGE~RMSE`, results$values$`LASSO~RMSE`, results$values$`EN~RMSE`, results$values$`CART~RMSE`, results$values$`kNN~RMSE`, results$values$`SVM~RMSE`, results$values$`BagCART~RMSE`, results$values$`RF~RMSE`, results$values$`GBM~RMSE`)))
## The average RMSE from all models is: 10409.16
email_notify(paste("Baseline Modeling Completed!",date()))
## [1] "Java-Object{org.apache.commons.mail.SimpleEmail@71be98f5}"

5. Improve Accuracy or Results

After we achieve a short list of machine learning algorithms with good level of accuracy, we can leverage ways to improve the accuracy of the models.

Using the two best-perfoming algorithms from the previous section, we will Search for a combination of parameters for each algorithm that yields the best results.

5.a) Algorithm Tuning

Finally, we will tune the best-performing algorithms from each group further and see whether we can get more accuracy out of them.

# Tuning algorithm #1 - ElasticNet
startTimeModule <- proc.time()
set.seed(seedNum)
grid <- expand.grid(lambda=c(0.1,0.01,0.001), fraction=c(0.25,0.5,1.0))
fit.final1 <- train(targetVar~., data=xy_train, method="enet", metric=metricTarget, tuneGrid=grid, trControl=control)
plot(fit.final1)

print(fit.final1)
## Elasticnet 
## 
## 27752 samples
##    35 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 1 times) 
## Summary of sample sizes: 24977, 24978, 24977, 24977, 24977, 24976, ... 
## Resampling results across tuning parameters:
## 
##   lambda  fraction  RMSE      Rsquared    MAE     
##   0.001   0.25      10333.48  0.02609749  3042.355
##   0.001   0.50      10320.01  0.02701972  3017.255
##   0.001   1.00      10328.44  0.02491409  3015.079
##   0.010   0.25      10335.26  0.02587109  3044.921
##   0.010   0.50      10320.38  0.02707263  3018.903
##   0.010   1.00      10327.58  0.02503490  3015.071
##   0.100   0.25      10338.26  0.02534705  3049.281
##   0.100   0.50      10321.14  0.02707121  3021.807
##   0.100   1.00      10325.85  0.02538533  3017.792
## 
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were fraction = 0.5 and lambda = 0.001.
proc.time()-startTimeModule
##    user  system elapsed 
##  13.722   0.011  13.882
email_notify(paste("Algorithm #1 Tuning Completed!",date()))
## [1] "Java-Object{org.apache.commons.mail.SimpleEmail@506e6d5e}"
# Tuning algorithm #2 - Stochastic Gradient Boostin
startTimeModule <- proc.time()
set.seed(seedNum)
grid <- expand.grid(.n.trees=c(50,100,150,200), .shrinkage=0.1, .interaction.depth=1, .n.minobsinnode=10)
fit.final2 <- train(targetVar~., data=xy_train, method="gbm", metric=metricTarget, tuneGrid=grid, trControl=control, verbose=F)
plot(fit.final2)

print(fit.final2)
## Stochastic Gradient Boosting 
## 
## 27752 samples
##    35 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 1 times) 
## Summary of sample sizes: 24977, 24978, 24977, 24977, 24977, 24976, ... 
## Resampling results across tuning parameters:
## 
##   n.trees  RMSE      Rsquared    MAE     
##    50      10314.27  0.02775123  3016.786
##   100      10312.53  0.02799842  3008.390
##   150      10312.85  0.02818552  3007.201
##   200      10315.19  0.02780375  3009.148
## 
## Tuning parameter 'interaction.depth' was held constant at a value of
##  1
## Tuning parameter 'shrinkage' was held constant at a value of 0.1
## 
## Tuning parameter 'n.minobsinnode' was held constant at a value of 10
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were n.trees = 100,
##  interaction.depth = 1, shrinkage = 0.1 and n.minobsinnode = 10.
proc.time()-startTimeModule
##    user  system elapsed 
##  34.545   0.003  34.888
email_notify(paste("Algorithm #2 Tuning Completed!",date()))
## [1] "Java-Object{org.apache.commons.mail.SimpleEmail@1a407d53}"

5.d) Compare Algorithms After Tuning

results <- resamples(list(RF=fit.final1, GBM=fit.final2))
summary(results)
## 
## Call:
## summary.resamples(object = results)
## 
## Models: RF, GBM 
## Number of resamples: 10 
## 
## MAE 
##         Min.  1st Qu.   Median     Mean  3rd Qu.     Max. NA's
## RF  2863.751 2904.611 3011.464 3017.255 3104.214 3252.223    0
## GBM 2824.746 2893.493 3002.380 3008.390 3104.141 3265.364    0
## 
## RMSE 
##         Min.  1st Qu.   Median     Mean  3rd Qu.     Max. NA's
## RF  6456.857 7466.983 9339.638 10320.01 13464.10 15724.53    0
## GBM 6395.026 7497.567 9331.405 10312.53 13467.22 15733.85    0
## 
## Rsquared 
##            Min.    1st Qu.     Median       Mean    3rd Qu.       Max.
## RF  0.009321880 0.01280049 0.02741774 0.02701972 0.03518014 0.05725424
## GBM 0.008359974 0.01409622 0.02910411 0.02799842 0.04020255 0.04894332
##     NA's
## RF     0
## GBM    0
dotplot(results)

6. Finalize Model and Present Results

Once we have narrow down to a model that we believe can make accurate predictions on unseen data, we are ready to finalize it. Finalizing a model may involve sub-tasks such as:

6.a) Predictions on validation dataset

predictions <- predict(fit.final2, newdata=xy_test)
print(RMSE(predictions, y_test))
## [1] 13007.02
print(R2(predictions, y_test))
## [1] 0.01880234

6.b) Create standalone model on entire training dataset

startTimeModule <- proc.time()
library(gbm)
## Loaded gbm 2.1.4
set.seed(seedNum)
finalModel <- gbm(targetVar~., data=xy_train, n.trees=100, shrinkage=0.1, interaction.depth=1, n.minobsinnode=10, verbose=F)
## Distribution not specified, assuming gaussian ...
summary(finalModel)

##                                                         var    rel.inf
## kw_avg_avg                                       kw_avg_avg 43.8227737
## self_reference_max_shares         self_reference_max_shares 14.8639662
## kw_avg_min                                       kw_avg_min 12.2598927
## self_reference_min_shares         self_reference_min_shares  6.2298674
## num_hrefs                                         num_hrefs  5.1769656
## kw_max_min                                       kw_max_min  3.3980851
## LDA_03                                               LDA_03  2.8387648
## num_videos                                       num_videos  2.0247700
## num_imgs                                           num_imgs  1.3504980
## kw_min_avg                                       kw_min_avg  1.2892783
## avg_negative_polarity                 avg_negative_polarity  1.0864920
## LDA_00                                               LDA_00  0.9985376
## LDA_02                                               LDA_02  0.8375696
## global_subjectivity                     global_subjectivity  0.7645784
## title_sentiment_polarity           title_sentiment_polarity  0.6722912
## n_tokens_title                               n_tokens_title  0.6221047
## max_negative_polarity                 max_negative_polarity  0.4932986
## avg_positive_polarity                 avg_positive_polarity  0.4738195
## kw_avg_max                                       kw_avg_max  0.4334553
## weekday_is_saturday                     weekday_is_saturday  0.3629912
## average_token_length                   average_token_length  0.0000000
## num_keywords                                   num_keywords  0.0000000
## data_channel_is_entertainment data_channel_is_entertainment  0.0000000
## data_channel_is_tech                   data_channel_is_tech  0.0000000
## data_channel_is_world                 data_channel_is_world  0.0000000
## is_weekend                                       is_weekend  0.0000000
## LDA_01                                               LDA_01  0.0000000
## LDA_04                                               LDA_04  0.0000000
## global_sentiment_polarity         global_sentiment_polarity  0.0000000
## global_rate_positive_words       global_rate_positive_words  0.0000000
## global_rate_negative_words       global_rate_negative_words  0.0000000
## rate_positive_words                     rate_positive_words  0.0000000
## min_negative_polarity                 min_negative_polarity  0.0000000
## title_subjectivity                       title_subjectivity  0.0000000
## abs_title_sentiment_polarity   abs_title_sentiment_polarity  0.0000000
proc.time()-startTimeModule
##    user  system elapsed 
##   1.821   0.000   1.841

6.c) Save model for later use

#saveRDS(finalModel, "./finalModel_Regression.rds")
proc.time()-startTimeScript
##      user    system   elapsed 
## 41566.508    60.999 42083.938
email_notify(paste("Model Validation and Final Model Creation Completed!",date()))
## [1] "Java-Object{org.apache.commons.mail.SimpleEmail@3cda1055}"